Round 1 - Technical
🔹 Write PySpark code to save a DataFrame to AWS S3 in Parquet format.
🔹 How do you overwrite a file stored in S3 using PySpark?
🔹 Explain versioning in S3.
🔹 Write an SQL query to generate the given output.
🔹 What are the steps to execute a Python file containing PySpark code on an AWS EC2 environment?
🔹 How do you copy a file from the local system to AWS S3 without using the upload feature of the S3 bucket?
🔹 If you execute the same query in Snowflake and Spark, which one takes less time?
🔹 In the above scenario, which one will be costlier?
🔹 Have you worked on any cost optimization methods while loading data into Snowflake from a data lake like S3?
Round 2 - Techno-Managerial
This round was conducted in two phases and aimed to assess my knowledge of Spark, AWS, Snowflake, Python, and SQL. The questions were scenario-specific, focusing on AWS data engineering services such as Glue, Lambda, EC2, S3, Redshift, and Athena. Key topics included:
🔹Deep dive into Spark's memory distribution for processing a 500GB file.
🔹Architecture-level questions.
🔹Writing an entire PySpark program from import statements to the stop statement.
🔹Testing SQL skills with window functions using `LAG`, `LEAD`, and `DENSE_RANK`.
🔹Questions on Spark optimization techniques.
Round 3 - HR Discussion
🔹Team Culture Discussion.
🔹Leaves and Holiday policies.
🔹Work culture discussion.
🔹Salary Negotiation.
🔹Variable component discussion.
I received the offer letter after the final HR round.😊